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Abstract 

As the availability of cost-effective high-throughput sequencing technology increases, genetic research is beginning 
to focus on identifying the contributions of rare variants (RVs) to complex traits. Using RVs to detect associated 
genes requires statistical approaches that mitigate the lack of power with the analysis of single RVs. Here we report 
the development and application of an approach that aggregates and evaluates the transmissions of RVs in parent- 
child trios. An initial score that incorporates the distortion in transmission of the observed RVs from the parents to 
their offspring is calculated for each variant. The scores are analyzed using a support vector machine that handles 
these data by mapping the transmission distortion of the multiple RVs into a one-dimensional score in a nonlinear 
fashion when parent-child trios with affected and nonaffected children are contrasted. We refer to this approach as 
Trio-SVM. A total of 275 trios were available in the Genetic Analysis Workshop 18 data for analysis. Because of their 
nonindependence and the extended linkage disequilibrium (LD) within pedigrees, Trio-SVM was vulnerable to type 
I errors in detecting association. Using the GAW18 data with simulated trait values, Trio-SVM has an appropriate 
type I error, but it lacks power with a sample of 267 trios. Larger samples of 500 to 1000 trios, derived from 
combining the simulated data, provided sufficient power. Two chromosome 3 candidate genes were tested in the 
real GAW18 data with Trio-SVM, and they showed marginal associations with hypertension. 



Background 

Genome-wide association studies (GWAS) of common 
variants have not explained the heritability estimates of 
common complex disorders [1]. In response, exome 
sequencing, which is designed to reveal rare variants 
(RVs) with a frequency less than a value in the range of 
1% to 5%, is being applied to pursue additional risk 
genes. Interpretation of RVs is best done for Mendelian 
disorders within pedigrees to identify significant loci and 
avoid artifacts of the sequencing process. However, for 
complex disorders and quantitative traits, RVs that segre- 
gate only within a few pedigrees do not provide adequate 
statistical power to implicate a particular gene when they 
are analyzed alone. Approaches to solve this problem 
involve the aggregation of RVs within genes and regions. 
We developed an approach, called Trio-SVM, using the 
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support vector machine (SVM) method that aggregates 
and tests the RVs of a gene for a dichotomized trait in 
parent-child trios [2]. Parent-child trios are used to test 
association through distortions in transmission from the 
parents to their children. An advantage of this approach 
is that (a) the transmission of RVs can be aggregated 
across genes and compared with their aggregation in 
controls by the SVM, and (b) population stratification is 
mitigated because only parents with the RV provide 
information in the analysis. That is, the differences in fre- 
quencies of RVs in different ethnic groups will have no 
effect on the test statistic because only opportunities for 
transmission in parents heterozygous for RVs contribute 
to the transmission distortion data used by the SVM. 

Using Trio-SVM, all members of the trios are sequenced 
for RVs, and the observed RV transmissions are compared 
with what is expected, given the parental RV genotypes. 
Transmission distortions in a gene are combined using 
SVM. The area under receiver operating characteristic 
(area under the curve [AUC]) was generated by SVM 



o 



© 2014 Lu and Cantor; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
DiftlUlml C 'an-hr-al Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
DlwlYKsU \_fcri llldl reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver 

(http://creativecommons.org/publicdomain/zero/1-0/) applies to the data made available in this article, unless otherwise stated. 



Lu and Cantor BMC Proceedings 2014, 8(Suppl 1)598 
http://www.biomedcentral.eom/1753-6561/8/S1/S98 



Page 2 of 4 



when contrasting the transmission between affected and 
unaffected children is estimated for each gene under ana- 
lysis and used as the test statistic. The strength of Trio- 
SVM is that it allows for each RV to either confer risk or 
to be protective and contribute to an overall score in 
which the direction of the effect for each RV is not a factor 
in the score. 

One potential concern is the availability and choice of 
control groups for the SVM. First, the control sample 
should be beyond the age of risk for the disorder under 
analysis and have appropriate environmental exposures 
when those are known to be important. Second, to pro- 
vide an opportunity for transmission of RVs from par- 
ents to their children, ethnic matching, although not 
necessary, may be helpful. An interesting choice might 
be the unaffected siblings from the trios in the study 
because they would have the same opportunities to 
inherit the RVs that are transmitted to their affected 
siblings. 

Methods 

An overview of support vector machine 

The purpose of the SVM is to discriminate between two 
groups using a set of variables. It is particularly useful 
when the number of variables is greater than the num- 
ber of individuals in the data set. SVM is based on a 
model with N ordered pairs (yi,x,) where Yi is a binary 
outcome with a vertex -1 assigned to one group and +1 
to the other and, Xj = (Xy),j = 1,2, ..,M, is a vector with 
M predictors. 

If "." denotes the dot product and " A " the parameter 
estimate, SVM constructs two hyperplanes in space, 
Hi : Xi.w + b = — 1 and H 2 : x\.w + b = +1 in which the 
weights w and the offset b are estimated to maximize 
2 

the separation of (— — — )between Hi and H 2 , with the 

I M I 

constraint y, (x;.u> + b) — 1 > 0; Vi (i.e., all of the observa- 
tions of two groups are separated by the two hyper- 
planes). The optimization is equivalent to minimizing 

Lp = ^ \\w || 2 + «i(l - yi (Xi.w + b)), with respect to 

w and b, where a; > 0 are Lagrange multipliers. Geome- 
trically, w is a function of N s support vectors, with non- 
zero oti that locate on the margins of Hi and H 2 ,and the 

N s 

solution of u>. Xt is given by ^q s y s x s .Xj. 

s 

SVM provides the advantage of allowing M to be > N 
because the solution that estimates w is based on the 
support vectors. An additional advantage is the relaxation 
of linear mapping by using a kernel function K that 
corresponds to a nonlinear function <p such that X; 
is replaced by <p(x;) and x s .Xi is replaced by 
(x s , Xj) = (p (x s ) .<p{Xi). Using a Gaussian kernel with a 



scale, er 2 , the dot product w .<p{x{) is expressed as 
^ s a s y s exp(-||x s - Xif/la^). A penalty term (in gen- 
eral denoted by C) is added for a generalization of the 
optimal hyperplane when the data do not allow the two 
groups to be completely separated, which limits the 
Lagrange multipliers to range between 0 and C. 

Adapting support vector machine for parent-child trios: 
Trio-SVM 

Trio-SVM analyzes a set of N parent-child trios, in 
which each child is described by a coordinate (yi, x,), for 
the M RVs observed for that child and his or her par- 
ents and Yi is -1 when the child is affected and +1 other- 
wise. Here Xj incorporates the conditional distribution of 
RVs on parental genotypes using the framework of the 
family-based association test (FBAT) [3]. At each RV 
site j, Xy is the difference between the observed and 
expected transmission of an RV to the child given the 
two parental genotypes for that RV. Using this, each 
child then gets a composite score for the test gene, 
Yiscom, which is modeled by w.cp (x,) +V> which aggre- 
gates the RV. The AUC (denoted by 6) of Yscore for H 0 : 
6 <= 0.5 vs H a : 6 > 0.5 is used to represent the compo- 
site scores for distorted transmission within a gene over 
the sample of trios comparing those who are affected 
with those who are not. Accepting that 0 is greater than 
0.5 indicates the combined RVs in the gene are trans- 
mitted with greater distortion from that which is 
expected in the cases when compared with the control 
participants. The test is one-sided because each group is 
assigned to a fixed vertex; the statistic '0/SE{6) is asymp- 
totically Gaussian. For case-control analyses, one would 
let x ij count the number of RVs at each site. 

Applying Trio-SVM to Genetic Analysis Workshop 18 
pedigree data 

For the GAW18 data, Trio-SVM was used to combine 
all observed RVs for a given gene, by selecting all 
affected and unaffected individuals having both parents 
in each pedigree and treating them as independent. A 
total of 275 such trios were derived from the 959 indivi- 
duals in 20 GAW18 pedigrees ascertained for type 2 
diabetes (T2D). The pedigree members were genotyped 
at 472,049 SNPs on GWAS platforms. Half of the sam- 
ple (n = 464) was sequenced at 8,348,674 sites, and 
imputation of nonsequenced individuals was performed 
using the GWAS data, thus providing each individual 
with a constellation of RVs. Blood pressures were taken 
longitudinally at from 1 to 4 exams for 932 participants. 
Hypertension was assigned based on systolic blood pres- 
sure (SBP) greater than 140 mm Hg, diastolic blood 
pressure (DBP) greater than 90 mm Hg, or use of anti- 
hypertensive medications. 
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Trio-SVM accepted the input of the GAW18 pedigree 
data in linkage format, and the noninformative sites in 
which no RVs were observed were removed. All trios with 
two parents available were gleaned from the pedigrees and 
treated as if they were independent for these analyses. How- 
ever, because they are not independent and LD reaches 
much greater distances in pedigrees than in independent 
samples, significant results with Trio-SVM may lead to 
false positives in such pedigrees. Specifically, for a disorder, 
if there is a causal common variant, its haplotype will segre- 
gate with the disorder throughout the pedigree. Any RVs 
that are on the haplotype in the pedigrees will be carried 
along with it, and genes that happen to have many RVs on 
that haplotype will be implicated. If the RVs are not in the 
causal gene, a type I error regarding association will occur. 

Analyses were focused on chromosome 3, as suggested 
by the GAW18 organizers. Two T2D GWAS candidate 
genes, ADCY5 at (3q21.1) [4] and UBE2E2 (3p24.2) [5], 
on a different arm of chromosome 3 were tested using 
Trio-SVM. For comparison, SVM without transmission 
information was used to analyze 108 founders consisting 
of 67 cases and 41 control participants. 

Trio-SVM type I and type II error rates using the 
simulated pedigree data 

Two hundred replicates of the genotyped data in the 
GAW18 pedigrees with the trait simulated under specific 
genetic models were available for assessments of type I and 
II statistical errors. The genes on chromosome 3 that were 
predisposing and nonpredisposing in the simulated models 
were tested. RVs were included in the analysis when their 
frequencies were less than 0.01 and less than 0.03 in two 
separate assessments. These analyses were performed on 
the simulated trait, hypertension, defined in two ways: (a) 
adjusted by age, age x gender, gender, and use of antihy- 
pertensive medications and (b) not adjusted. Covariates 
were included using linear mixed models with a random 
effect to account for the intrapedigree correlation. The 
traits were adjusted to age 38, no medications, and male 
gender. Power analyses were based on evaluating the pre- 
disposing gene MAP4, and the type I errors were assessed 
for the nonpredisposing gene, ARL13B (93.8 Mb), located 
between RYBP (72.5Mb) and B4GALT4 (118.9Mb), where 
both influenced DBP or SBP. To evaluate the power in a 
larger sample size, 500 and 1000 trios were drawn from the 
200 replicates using bootstrap sampling. 

Trio-SVM was evaluated using a Gaussian kernel (o^ 
fixed at 1) and 5-fold cross-validation for model selec- 
tion across different C, from 1 to 10. 

Results and discussion 

Trio-SVM analysis of type 2 diabetes candidate genes 

Table 1 reports the results of the two candidate genes 
tested for association in the GAW18 pedigrees with Trio- 



Table 1 Trio-SVM analyses of candidate genes in GAW18 
trios and founders 

Trios (n = 275) Founders (n = 108) 
66 case trios 50 cases founders 
209 control trios 58 control founders 



Gene (#Bp) 


#RV 
sites 


AUC (SE) 


p-Value 


AUC (SE) 


p-Value 


ADCY5 


426 


0.637 


3.2E-04 


0.554 


0.17 


(166,249) 




(0.040) 




(0.056) 




UBE2E2 


917 


0.575 


0.035 


0.539 


0.25 


(387,512) 




(0.041) 




(0.057) 





AUC, the area under the curve; RV, rare variant; SE, standard error. 



SVM. They both show association with hypertension with 
^-values of 3.2E-04 for ADCY5 at 3q21.1 and 0.035 for 
UBE2E2 at 3p24.2. For this analysis, genes were selected 
on both arms of chromosome 3 because we wanted to see 
if we could detect independent signals given the poor reso- 
lution because of LD in pedigrees. Association in the set of 
108 founders, in which 58 independent cases were com- 
pared with 50 independent control participants, was not 
detected. However, this limited sample provides very low 
power. A significant p-value for the AUC statistic has two 
possible interpretations. Either a consistent set of SNPs is 
responsible for the signals in a gene or different variants 
are contributing to the signals in the different pedigrees 
[6] . However, disentangling these with the large number 
of variants contributing to these signals is not 
straightforward. 

Trio-SVM: power and type I error 

Table 2 summarizes the assessments of power and type 
I error. In the adjustment, we used the averages of slope 
estimates over 200 linear mixed models on the 200 
replicates. The powers were increased by the adjust- 
ment. Of the sites with minor allele frequencies (MAFs) 
less than 0.03, 9 individual variants explained a total 
variance of 1.73 to 2.04% in DBP or SBP traits while the 
variant, temp_323826, was removed in the set of MAF 
less than 0.01 that explained a total variance of 0.8 to 
0.9%. The maximum power, 0.19, was achieved by using 
the adjusted simulated traits and was increased by 0.05 
as adding 91 sites (0.01 <MAF <0.03). The small incre- 
ment may be a result of the fact that the sites in LD 
with the functional variant, temp_323826, were already 
included in the group with MAFs less than 0.01. Subse- 
quently, we evaluated the power of MAFs less than 0.03 
using large samples generated from the bootstrap and 
the adjusted simulated traits (Table 3). A 0.80 power at 
0.05 a was nearly reached using 500 trios, and the 
power was over 0.80 at a more stringent a (0.0001) 
using 1000 trios. It is of importance that type I errors 
distributed around 0.05 and were not inflated by the 
adjustment. 
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Table 2 Trio-SVM type I error and power in GAW18 
simulated data (267 trios) 



RV Trait 
frequency adjusted 


#RV 
sites 


Power #RV 
sites 


Type 1 error (for p- 
values <0.05) 


<0.03 No 


405 


0.15 115 


0.040 


Yes 




0.19 


0.055 


<0.01 No 


314 


0.11 91 


0.065 


Yes 




0.14 


0.065 


RV, rare variant. 








Table 3 Trio-SVM power in multiple replicates 


a 1 




500 trios 


1000 trios 


p-Value <0.05 




0.755 


0.995 


p-Value <0.01 




0.610 


0.980 


p-Value <0.001 




0.405 


0.930 


p-Value <0.0001 




0.215 


0.870 



1 Level of significance for the gene under analysis. 



Conclusions 

Applications of machine learning methods in genomic 
data are just beginning [7-9]. Using SVM, we developed 
a novel approach for analysis of RVs to handle high- 
dimensional genomic data, relax a linear relationship 
between (y,, x{), and control population stratification. 
One disadvantage is that the magnitude of u> cannot be 
explicitly expressed by using a nonlinear kernel. Impor- 
tantly, we can detect the association between RVs and a 
test trait when applying Trio-SVM to a sample com- 
posed of nuclear families. Our future work is to increase 
the power by considering other newly defined kernel 
functions, such as wavelet transform, and make the 
extension a viable option in our code. The MATLAB 
code of Trio-SVM can be obtained from the authors. 
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